Skip to content

fix: correct CPU usage graph pinned at 100%#14

Merged
TerrifiedBug merged 1 commit intomainfrom
fix/cpu-metrics-calculation
Mar 5, 2026
Merged

fix: correct CPU usage graph pinned at 100%#14
TerrifiedBug merged 1 commit intomainfrom
fix/cpu-metrics-calculation

Conversation

@TerrifiedBug
Copy link
Owner

Summary

  • CPU usage graph was always pinned at 100% due to incorrect calculation
  • Vector's host_cpu_seconds_total is a per-core, per-mode counter — summing all modes (including idle) across all cores meant the delta always exceeded wall-clock time
  • Fix: track idle CPU seconds separately and compute CPU% = (total - idle) / total * 100

Changes

Full-stack fix across 7 files:

  • Agent scraper: parse mode label from host_cpu_seconds_total, sum idle+iowait into new CpuSecondsIdle field
  • Agent structs + heartbeat: add CpuSecondsIdle to HostMetrics struct and heartbeat builder
  • Server: accept cpuSecondsIdle in heartbeat Zod schema, store in DB
  • Prisma: add cpuSecondsIdle Float @default(0) column to NodeMetric + migration
  • Frontend: replace (cpuDelta / dtSeconds) * 100 with ((totalDelta - idleDelta) / totalDelta) * 100

Test plan

  • Deploy updated agent and server
  • Verify CPU graph shows realistic utilization (not pinned at 100%)
  • Verify existing metrics (memory, disk, network) are unaffected
  • Confirm backward compat: old agents without cpuSecondsIdle default to 0 (CPU shows 100% until agent updates, same as before)

The CPU graph was pinned at 100% because host_cpu_seconds_total from
Vector is a per-core, per-mode counter. Summing all modes (including
idle) across all cores meant the delta always exceeded wall-clock time,
so (delta/dt)*100 was always >100% and got clamped.

Fix: track idle CPU seconds separately and compute utilization as
(total - idle) / total * 100, which is core-count independent and
gives accurate whole-server CPU utilization.

Changes across the full stack:
- Agent scraper: filter by mode label, sum idle+iowait separately
- Agent structs/heartbeat: add CpuSecondsIdle field
- Server heartbeat route: accept and store cpuSecondsIdle
- Prisma schema + migration: add cpuSecondsIdle column
- Fleet router: return new field
- Frontend chart: new formula using idle delta
@greptile-apps
Copy link

greptile-apps bot commented Mar 5, 2026

Greptile Summary

This PR fixes a CPU usage graph that was permanently pinned at 100% by introducing a separate cpuSecondsIdle counter (idle + iowait modes) throughout the full stack — agent scraper, Go structs, heartbeat payload, Zod schema, Prisma model + migration, tRPC select, and frontend chart — and replacing the old wall-clock-time formula with (totalDelta - idleDelta) / totalDelta * 100.

  • Root cause fix is correct: the old approach divided accumulated CPU-seconds (all cores × all modes) by elapsed wall-clock seconds, which always exceeded 1.0 × 100% for multi-core hosts. The new ratio-based formula is the standard way to compute CPU utilization from host_cpu_seconds_total counters.
  • iowait is grouped into the idle bucket (scraper.go lines 150-152). This is a semantic trade-off: I/O-bound workloads will report lower CPU%, which could mask disk pressure. Consider whether surfacing iowait separately or documenting the choice is warranted.
  • Backward compatibility is handled correctly: old agents without cpuSecondsIdle send 0, the DB column defaults to 0, and the frontend clamps the result — reproducing the prior 100% display until agents update, as documented in the PR.
  • Migration is safe: NOT NULL DEFAULT 0 on a DOUBLE PRECISION column is a non-destructive, backward-compatible change for existing rows.
  • Minor indentation error in heartbeat.go: the new CpuSecondsIdle field is indented one tab level shallower than all surrounding fields; gofmt would flag this.

Confidence Score: 4/5

  • Safe to merge — the core formula change is mathematically correct and the full-stack propagation is consistent.
  • The fix correctly addresses the root cause, all layers are updated consistently, backward compatibility is maintained, and the migration is non-destructive. Score is 4 rather than 5 only because of the iowait semantic ambiguity and the minor gofmt indentation issue in heartbeat.go.
  • agent/internal/metrics/scraper.go (iowait classification) and agent/internal/agent/heartbeat.go (indentation)

Important Files Changed

Filename Overview
agent/internal/metrics/scraper.go Correctly accumulates CpuSecondsIdle from idle+iowait mode labels; minor semantic concern about iowait classification.
agent/internal/agent/heartbeat.go Correctly wires CpuSecondsIdle into the heartbeat payload; new line has a tab-level indentation error.
src/app/api/agent/heartbeat/route.ts cpuSecondsIdle added to Zod schema as optional and persisted with correct null-default fallback; no auth or validation regressions.
src/components/fleet/node-metrics-charts.tsx CPU% formula correctly changed to (totalDelta - idleDelta) / totalDelta; clamp guards and the i>0 check are preserved.
prisma/migrations/20260305100000_add_cpu_seconds_idle/migration.sql Non-destructive ALTER TABLE with NOT NULL DEFAULT 0; backward-compatible for existing rows.

Sequence Diagram

sequenceDiagram
    participant V as Vector (Prometheus)
    participant S as scraper.go
    participant H as heartbeat.go
    participant API as /api/agent/heartbeat
    participant DB as PostgreSQL (NodeMetric)
    participant FE as node-metrics-charts.tsx

    V->>S: host_cpu_seconds_total{mode="idle"} += x
    V->>S: host_cpu_seconds_total{mode="iowait"} += y
    V->>S: host_cpu_seconds_total{mode="user"} += z
    note over S: CpuSecondsTotal += all modes<br/>CpuSecondsIdle += idle + iowait
    S->>H: HostMetrics{CpuSecondsTotal, CpuSecondsIdle}
    H->>API: POST heartbeat {cpuSecondsTotal, cpuSecondsIdle}
    API->>API: Zod validate (both optional)
    API->>DB: NodeMetric.create({cpuSecondsTotal, cpuSecondsIdle})
    FE->>DB: fleet.nodeMetrics query (cpuSecondsTotal, cpuSecondsIdle selected)
    DB-->>FE: time-series rows
    note over FE: cpuPercent = (totalDelta - idleDelta) / totalDelta * 100<br/>clamped to [0, 100]
Loading

Last reviewed commit: 1663a49

MemoryUsedBytes: sr.Host.MemoryUsedBytes,
MemoryFreeBytes: sr.Host.MemoryFreeBytes,
CpuSecondsTotal: sr.Host.CpuSecondsTotal,
CpuSecondsIdle: sr.Host.CpuSecondsIdle,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Misaligned indentation on new field

The CpuSecondsIdle line uses one fewer tab than every other field in the same struct literal. Running gofmt would flag this. While it compiles fine, it breaks visual alignment and will cause noisy diffs in future edits.

Suggested change
CpuSecondsIdle: sr.Host.CpuSecondsIdle,
CpuSecondsIdle: sr.Host.CpuSecondsIdle,
Prompt To Fix With AI
This is a comment left during a code review.
Path: agent/internal/agent/heartbeat.go
Line: 63

Comment:
**Misaligned indentation on new field**

The `CpuSecondsIdle` line uses one fewer tab than every other field in the same struct literal. Running `gofmt` would flag this. While it compiles fine, it breaks visual alignment and will cause noisy diffs in future edits.

```suggestion
				CpuSecondsIdle:   sr.Host.CpuSecondsIdle,
```

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +150 to +152
if mode == "idle" || mode == "iowait" {
sr.Host.CpuSecondsIdle += value
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider excluding iowait from the idle bucket

Classifying iowait as "idle" means I/O-bound workloads will show artificially low CPU utilization on the graph. For example, a system that's 80% blocked on disk reads will report ~20% CPU busy even though it's clearly under stress.

Standard tools like iostat and htop report iowait as a separate category precisely to make I/O pressure visible. The field is also named CpuSecondsIdle, which implies pure idle time.

If the intent is "CPU not doing compute work", renaming the field to CpuSecondsNonBusy (and documenting that it includes iowait) would at least make the semantics explicit. Alternatively, tracking idle only and displaying iowait as a separate series in the chart gives users richer diagnostic information.

Prompt To Fix With AI
This is a comment left during a code review.
Path: agent/internal/metrics/scraper.go
Line: 150-152

Comment:
**Consider excluding `iowait` from the idle bucket**

Classifying `iowait` as "idle" means I/O-bound workloads will show artificially low CPU utilization on the graph. For example, a system that's 80% blocked on disk reads will report ~20% CPU busy even though it's clearly under stress.

Standard tools like `iostat` and `htop` report `iowait` as a separate category precisely to make I/O pressure visible. The field is also named `CpuSecondsIdle`, which implies pure idle time.

If the intent is "CPU not doing compute work", renaming the field to `CpuSecondsNonBusy` (and documenting that it includes iowait) would at least make the semantics explicit. Alternatively, tracking `idle` only and displaying `iowait` as a separate series in the chart gives users richer diagnostic information.

How can I resolve this? If you propose a fix, please make it concise.

@TerrifiedBug TerrifiedBug merged commit e559235 into main Mar 5, 2026
1 check passed
@TerrifiedBug TerrifiedBug deleted the fix/cpu-metrics-calculation branch March 5, 2026 17:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant